Kinesis Analytics
- Querying streams of data continuously
- Can receive data from and send results to
- Kinesis Data Streams
- Kinesis Data Firehose
- Analysis using SQL
- Can use reference table from S3 to help analysis
- Errors will be output to Error Stream
Kinesis Analytics Use Cases
- Streaming ETL
- Continuous metric generation
- Responsive analytics
Schema discovery
- Generate data schema automatically by feeding some stream data
RANDOM_CUT_FOREST
- A SQL function offered by Kinesis Data Analytics
- For anomaly detection
Amazon Elasticsearch Service (ES)
- Petabyte-scale search, analysis, and reporting tools
- Fundamentally based on a search engine (Lucine)
- ES can be regarded as a analysis tool
- Has a visualization tool (Kibana)
- Can use data pipeline to send stream data to ES
- Kinesis
- Beats, LogStash, Apache Kafka
- ES API
- Horizontally scalable
ES Use Cases
- Text search
- Log analytics
- Application monitoring
- Security analytics
- Click stream analytics
ES Concepts
- Documents
- Text or JSON
- Types
- Define th schema and mapping shared by documents
- Indices
- Power search into all documents within a collection of types
- An index is split into shard
- A primary shard and replicas
- Write requests are routed to the primary shard, then replicated
- Read requests are routed any shards
ES Features
- Full-managed but not serverless, ES runs based on EC2
- Can scale up or down without downtime but should do it manually
- Network isolation (VPC)
- Can use all the features of VPC
- AWS integration
- AWS IoT
- S3 via Lambda to Kinesis
- Kinesis Data Streams
- DynamoDB Streams
ES Options
- Dedicated master node(s)
- Only used for the management and dose not hold or process data
- Decide numbers and instance types
- Domains
- A collection of all the resources of a ES cluster
- Automatic snapshots to S3
ES Security
- Resource, identity, IP-based policies
- Request signing
- VPC
- Cognito
ES Anti-patterns
- OLTP
- RDS or DynamoDB
- Ad-hoc data querying
- Athena
Amazon Athena
- Interactive query service for S3 in SQL
- Use Presto
- Supported data formats
- CSV
- JSON
- ORC (columnar, splittable)
- Parquet (columnar, splittable)
- Avro (splittable)
Athena Integration
- Jupyter, Zeppelin, RStudio notebooks
- QuickSight
- Other visualization tools via ODBC / JDBC
Athena with Glue
- Use Glue to define unstructured data in S3
Athena Cost Model
- Successful and cancelled queries count, failed queries not
- No charge for DDL
- Save money by using columnar formats
- ORC, Parquet
- And better performance
- Glue and S3 charge separately
Athena Security
- Access control
- IAM, ACLs, S3 bucket policies
- Cross-account access in S3 bucket policy possible
- Encrypt results at rest in S3 staging directory
- TLS encrypts in transit
Athena Anti-patterns
- Highly formatted reports / visualization
- Use QuickSight
- ETL
- Use Glue
Amazon Redshift
- Fully-managed, petabyte scale data warehouse service
- Designed for OLAP, not OLTP
- SQL, ODBC, JDBC interfaces just like other RDB
- Easily scale up and down manually
- Built-in replication & backups
Redshift Architecture
- Redshift Cluster
- A leader node
- Manage communication with clients & compute nodes
- Receives queries form clients & develops execution plans
- Coordinates the parallel execution of those plans with compute nodes
- Aggregates the intermediate results from compute nodes
- 1~128 compute nodes
- Store user data
- Execute the steps in the execution plans
- Can transmit data among themselves
- Node types
- Dense storage (DS): xl or 8xl
- HDD
- Low costs
- Dense compute (DC): xl or 8xl
- SSD
- Larger memory
- Faster CPU
- Dense storage (DS): xl or 8xl
- Slices
- Each compute node is divided into slices
- A slice allocates a portion of the memory and disk storage of the node
- Size of slices is determined by the node size of the cluster
- A leader node
Redshift Spectrum
- Query unstructured data in S3 like Redshift table without loading
- Limitless concurrency & horizontal scaling
- Support wide variety of data formats
- Support Gzip and Snappy compression
Redshift Performance
- Massively Parallel Processing (MPP)
- Columnar Data Storage
- Column Compression
Redshift Durability
- Redshift has 3 copies of data
- An original copy within the cluster
- A backup repica copy within the cluster
- Continuously backed up to S3
- Can furthermore asynchronously replicated to another region
- Default retention period is 1 day, up to 35 days, 0 to turn off
- Can furthermore asynchronously replicated to another region
- Redshift will mirror each drive’s data to another nodes within the cluster if there are 2 or more compute nodes
- Can detect a failed drive or node, and replace if automatically
- Drive failure
- Redshift will remain available but performance may be declined
- Node failure
- Redshift will be unavailable during recovery
- Most frequently access data from S3 is loaded first which you can querying as quickly as possible
- Redshift is limited to a single AZ
- You have to restore the data from S3 in a different AZ when facing AZ failure
Redshift Scaling
- Vertical and horizontal scaling on demand
- During scaling
- A nwe cluster is created while old cluster remains available for reads
- CNMAE is flipped to new cluster with a few minutes of downtime
- Data moved in parallel to new compute nodes
Redshift Distribution Styles
- AUTO
- Default Style, choose one of EVEN, KEY, ALL depends on the size of data
- EVEN
- Rows distributed across slices in round-robin
- KEY
- Rows distributed based on a column hash
- ALL
- Entire table is copied to every node
Redshift Sort Keys
- Rows are stored on disk in sorted order based on the column designated as a sort key
- Like an index
- Makes for fast range queries
- Single vs. Compound vs. Interleaved sort keys
- Single
- Compound (default): designate all the columns as sort keys
- Performance will decrease when queries depend only on secondary sort key without referencing the primary sort key
- The order is important
- Improve compression performance
- Interleaved: give equal weight to each column or subset of columns in the sort key
- When multiple queries use different columns for filters
Redshift Import / Export Data
- COPY command
- Parallel
- From S3, DynamoDB, EMR / EC2 / other remote hosts vis SSH
- Copy from S3
- Use S3 object prefix or path
- Manifest file
- Authorization
- IAM role based
- Key based
- UNLOAD command
- Efficient way to unload from a table into files in S3
- Enhanced VPC routing
- Force all the traffic in a VPC than public Internet
Redshift Copy Grant for Cross-region Snapshot Copy
- In the destination AWS region
- Create a KMS key or use an old one in the destination region
- Specify the KMS key ID for the copy grant in the destination region
- Specify an unique name for the copy grant and enable cross-region snapshot in the source region
In the source AWS region
- Enable copying of snapshots to the copy grant
DBLINK
- Connect Redshift to PostgreSQL
- Used to copy and sync data between PostgreSQL and Redshift
Redshift Integration
- S3
- COPY / UNLOAD command
- DynamoDB
- COPY command
- EMR / EC2
- COPY command via SSH
- Data Pipeline
- Database Migration Service (DMS)
Redshift Workload Management (WLM)
- Manage query priorities using query queues
- Avoid short, fast queries being stuck by long, slow queries
- Setting by console, CLI, or API
Redshift VACUUM Command
- Clean table
- VACUUM FULL (default)
- Resort rows and reclaim space from deleted rows
- VACUUM DELETE ONLY
- VACUUM SORT ONLY
- VACUUM REINDEX
- Reinitialize Interleaved sort keys, then do VACUUM FULL
Redshift Security
- Database can be encrypted using KMS or HSM
Redshift Anti-patterns
- Small data sets
- Use RDS
- OLTP
- Use RDS or DynamoDB
- Unstructured data
- ETL first with EMR or Glue
- Or use Redshift Spectrum
- BLOB data
- Use S3
Amazon RDS
- Hosted relational database service
- Not for big data
ACID
- RDS offer full ACID compliance
- Atomicity
- Consistency
- Isolation
- Durability
Amazon Aurora
- Up to 64TB per database instance
- Up to 15 read replicas
- Can continuous backup to S3
- Can auto scaling with Aurora serverless
Aurora Security
- In VPC
- At-rest with KMS
- In-transit with SSL